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, Abstract. We describe a technique for mechanically proving certain 

jrt ' kinds of theorems in combinatorics on words, using automata and a 

package for manipulating them. We illustrate our technique by solving, 
purely mechanically, an open problem of Currie and Saari on the lengths 
Q^ ' of unbordered factors in the Thue-Morse sequence. 
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1 Introduction 



The title of this paper is a bit of a pun. On the one hand, we are concerned 

with certain natural questions about automatic sequences: sequences over a finite 
\^ . alphabet where the n'th term is expressible as a finite-state function of the base- 

k representation of n. On the other hand, we are interested in answering these 
l/-v ■ questions purely mechanically, in an automated fashion. 

[■■--. I Let X — (a(n))„>o be an infinite sequence over a finite alphabet A. Then x is 

CO ■ said to be k-automatic if there is a deterministic finite automaton M taking as 

^^ I input the base-fc representation of n, and having a{n) as the output associated 

f^ ■ with the last state encountered [3 . In this case, we say that M generates the 

sequence x. 

For example, in Figure [1] we give an automaton generating the well-known 

Thue-Morse sequence t = t{0)t{l)t{2) • • • = 011010011001 •••[!]. The input is 
I^ . n, expressed in base 2, and the output is the number contained in the state last 

''jjj ' reached. Thus t{n) is the sum, modulo 2, of the binary digits of n. 




Fig. 1. A finite automaton generating the Thue-Morse sequence 



For at least 25 years, researchers have been interested in the algorithmic 
decidability of assertions about automatic sequences. For example, in one of the 
earliest results, Honkala TT showed that, given an automaton, it is decidable if 
the sequence it generates is ultimately periodic. 

Recently, AUouche et al. [1] found a different proof of Honkala's result using a 
more general technique. Using this technique, they were able to give algorithmic 
solutions to many classical problems from combinatorics on words such as 

Given an automaton, is the generated sequence squarefree? Or overlapfree? 

We write x[i] — a{i), and we let x.[i..i + n — I] denote the factor of length 
n beginning at position i in x. A sequence is said to be squarefree if it contains 
no factor of the form xx, where x is a nonempty word, and is said to overlapfree 
if it contains no factor of the form ayaya, where a is a single letter and y is a 
possibly empty word. 

The technique of Allouche et al. is at its core, very similar to work of Biichi, 
Bruyere, Michaux, Villemaire, and others, involving formal logic; see, e.g., [5]. 
The basic idea is as follows: given the automaton M, and some predicate P(n) we 
want to check, we alter M by a series of transformations to a new automaton M' 
that accepts the base-fc representations of those integers n for which P{n) is true. 
Then we can check the assertion "3n P{n)" simply by checking if M' accepts 
anything (which can be done by a standard depth-first search on the underlying 
directed graph of the automaton). We can check the assertion "Vn P{n)" by 
checking if M' accepts everything. And we can check assertions like "P{n) holds 
for infinitely many n" by checking if A'l' has a reachable cycle from which a final 
state is reachable. 

Using this idea, Allouche et al. were able to show to reprove, purely mechan- 
ically using a computer program, the classic theorem of Thue |24|25|4j that the 
Thue-Morse sequence t is overlapfree. 

More recently, the technique has been applied to give decision procedures for 
other properties of automatic sequences. For example, Charlier et al. [6] showed 
that it can be used to decide if a given fc-automatic sequence 

— contains powers of arbitrarily large exponent; 

— is recurrent; 

— is uniformly recurrent. 

A sequence is said to be recurrent if every factor that occurs, occurs infinitely 
often. A sequence x is said to be uniformly recurrent if it is recurrent and fur- 
thermore for each finite factor w occurring in x, there is a constant c{w) such 
that two consecutive occurrences of w are separated by at most c{w) positions. 

More recently, variations of the technique have been used to 



— compute the critical exponent; 

— compute the initial critical exponent; 

— decide if a sequence is linearly recurrent; 

— compute the Diophantine exponent. 

(For definitions of these terms see [22].) 



2 The decision procedure 

In [5] we have the following theorem: 

Theorem 1. // we can express a property of a k-automatic sequence x using 
quantifiers, logical operations, integer variables, the operations of addition, sub- 
traction, indexing into x, and comparison of integers or elements ofx, then this 
property is algorithmically decidable. 

Let us outline how the decision procedure works. 

First, the input to the decision procedure: an automaton M = (Q, Ek, A, 5, q^, r) 
generating the /c-automatic sequence x. Here 

— Q is a nonempty set of states; 

— rfc:={0,l,...,fc-l}; 

— Z\ is the output alphabet; 

— 5 : Q X S ^f Q IB the transition function; 

— go is the initial state; and 

— T : Q — > Z\ is the output mapping. 

In this paper, we assume that the automaton takes as input the represen- 
tation of n in base fc, starting with the least significant digit; we call this the 
reversed representation of n and write it as (n)^. We allow leading zeroes in the 
representation (which, because of our convention, are actually trailing zeroes) . 
Thus, for example. Oil and 01100 are both acceptable representations for 6 in 
base 2. 

We might also need to encode pairs, triples, or r-tuples of integers. We handle 
these by first padding the reversed representation of the smaller integer with 
trailing zeroes, and then coding the r-tuple as a word over EJ,. For example, the 
pair (20, 13) could be represented in base-2 as 

[0,1] [0,0] [1,1] [0,1] [1,0], 

where the first components spell out 00101 and the second components spell out 
10110. Of course, there are other possible representations, such as 

[0,1] [0,0] [1,1] [0,1] [1,0] [0,0], 

which correspond to non-canonical representations having trailing zeroes; these 
are also permitted. 

Rather than present a detailed proof, we illustrate the idea of the decision 
procedure in the proof of the following new result: 

Theorem 2. The following problem is algorithmically decidable: given two k- 
automatic sequences x and y, generated by automata Mi and M2, respectively, 
decide if jc is a shift of y (that is, decide if there exists a constant c such that 
x[7i] = y[n + c] for all n > 0. 



Proof. We first create an NFA M that accepts the language 

{{c)k : 3n such that x[n] ^ y[n + c]}. 
To do so, on input {c)k, M 

— guesses Wi — (n)^ nondeterministically (perhaps with traihng zeroes ap- 
pended), 

— simulates Mi on wi , 

— adds n to c and computes the base-fc representation of W2 = {n + c)k digit- 
by-digit "on the fly" , keeping track of carries, as necessary, and simulates 
M2 on W2, and 

— accepts if the outputs of both machine differ. 

We now convert M to a DFA M' , and change final states to non-final (and 
vice versa). Then M' accepts the language 

{(c)fe : x[n] = y[n + c] for all n > 0}. 

Thus, X is a shift of y if and only if M' accepts any word, which is easily checked 
through depth-first search. D 

Remark 1. As we can see, the size of the automata involved depends, in an un- 
pleasant way, on the number of quantifiers needed to state the logical expression 
characterizing the property being checked, because existential quantifiers are im- 
plemented through nondeterminism, and universal quantifiers are implemented 
through nondeterminism and complementation (which is implemented in a DFA 
by exchange of the role final and non- final states). Thus each new quantifier 
could increase the current number of states, say n, to 2" using the subset con- 
struction. If the original automata have at most N states, it follows that the 
running time is bounded by an expression of the form 



.2P(") 



2^" 



where p is a polynomial and the number of exponents in the tower is one less 
than the number of quantifiers in the logical formula characterizing the property 
being checked. 

This extraordinary computational complexity raises the natural question of 
whether the decision procedure could actually be implemented for anything but 
toy examples. Luckily the answer seems to be yes — at least in some cases — 
as we will see below. 



3 Borders 

A word w is bordered if it begins and ends with the same word x with < 
|a;| < 1^1/2; Otherwise it is unhordered. An example in English of a bordered 



word is entanglement. A bordered word is also called bifix in the literature, and 
unbordered words are also called bifix-free or primary. 

Bordered and unbordered words have been actively studied in the literature, 
particularly with regard to the Ehrenfeucht-Silberger problem; see, for example, 
il3ll8ll0lllll4ll5l7ll6ll9TT2] . just to name a few. 

Currie and Saari fF studied the unbordered factors of the Thue-Morse se- 
quence t. They proved that if n ^ 1 (mod 6), then t has an unbordered factor 
of length n. (Also see [U] Lemma 4.10 and Problem 4.1].) However, this is not 
a necessary condition, as 

t[39..69] = 0011010010110100110010110100101, 

which is an unbordered factor of length 31. Currie and Saari left it as an open 
problem to give a complete characterization of the integers n for which t has an 
unbordered factor of length n. 

The following theorem and proof, quoted practically verbatim from [6^ , shows 
that, more generally, the characteristic sequence of n for which a given k- 
automatic sequence has an unbordered factor of length n, is itself /c-automatic: 

Theorem 3. Let x = a(0)a(l)a(2) • • • be a k-automatic sequence. Then the 
associated infinite sequence b = 6(0)6(1)6(2) • • • defined by 

J 1, if X has an unbordered factor of length n; 
b[n) ~ < 

I 0, otherwise; 

is k-automatic. 

Proof. The sequence x has an unbordered factor of length n 

iff 
3j > such that the factor of length n beginning at position j of x is unbordered 

iff 

there exists an integer j > such that for all possible lengths / with 1 < / < n/2, 
there is an integer i with < i < I such that the supposed border of length / 
beginning and ending the factor of length n beginning at position j of x actually 
differs in the i'th position 

iff 

there exists an integer j > such that for all integers I with 1 < / < n/2 there 
exists an integer i with < i < I such that x[j + i] y^ x[j + n — I + i]. 

Now assume x is a /c-automatic sequence, generated by some finite automa- 
ton. We show how to implement the characterization given above with an au- 
tomaton. 

We first create an NFA that given the (j, /, n)k guesses the base-fc represen- 
tation of i, digit-by-digit, checks that i < I, computes j + i and j + n — l + i on 
the fly, and checks that x[j + i]^ x[j -\-n — l + i].li such an i is found, it accepts. 
We then convert this to a DFA, and interchange accepting and nonaccepting 



states. This DFA Mi accepts {j,l,n)k such that there is no i, < i < / such 
that x[j/ + i] = x[j + n — I + i]. Wc then use Mi as a subroutine to build an NFA 
M2 that on input (j, n)k guesses I, checks that 1 < / < n/2, and cahs Mi on the 
result. We convert this to a DFA and interchange accepting and nonaccepting 
states to get M3. Finally, this M^ is used as a subroutine to build an NFA M4 
that on input n guesses j and calls M3. 

The characteristic sequence of these integers n is therefore fc-automatic. D 

Since the proof is constructive, one can, in principle, carry out the con- 
struction to get an explicit description of the lengths for which the Thue-Morse 
sequence has an unbordered factor. 

Doing so results in the following theorem: 

Theorem 4. There is an unbordered factor of length n in t if and only if the 
base-2 representation of n (starting with the most significant digit) is not of the 
form 1(01*0)*10*1. 

Proof. The proof of this theorem is purely mechanical, and it involves performing 
a sequence of operations on finite automata. The second author wrote a program 
in C-|— 1-, using his own automata package, to perform these operations. There 
are four stages to the computation, which are described in detail below. 

Stage 1 

Let T be the automaton of Figure [1] generating the Thue-Morse sequence 
t. Stage 1 takes T as input and outputs an automaton Mi, where Mi accepts 
w € ({0, 1}*)* if and only if w is the base-2 representation of some {n,j, l,i) G ^i, 
where 

Si ^ {{n,j,l,i) : < I < n/2 a.ndi < j andt[j + i]^t[n + j - l + i]}. (1) 

The size of Mi was only 102 states. However, since the input alphabet for 
Ml is of size 2** = 16, a considerable amount of complexity is being stored in the 
transition matrix. Stage 1 passed all 1.3 million tests meant to ensure that Mi 
corresponds to Si. 

Stage 2 

The purpose of Stage 2 is to remove the variable i by simulating it. The 
resulting machine, after being negated, accepts {n,j, I) iff the length n factor of 
t starting at index j has a border of length I. So Stage 2 produces the automaton 
M2, which is the negation of the result of simulating i. More formally, M2 accepts 
a word w £ ({0,1}'^)* if and only if w is the base-2 representation of some 
{n,j,l) G 5*2, where 

S2 = {{n,jj) -.^i for which {n,j,l,i) e Si} (2) 

The size of A/2 after subset construction was 8689 states, and it minimized 
down to 127 states. The output of Stage 2 passed all 1.6 million tests meant to 



ensure that M2 corresponds to 82- 

Stage 3 

The purpose of Stage 3 is to remove I by simulating it. By the end of Stage 3, 
most of the work has aheady been done. The output of Stage 3, M3, accepts an 
input word w S ({0, 1}^)* if and only if w is the base-2 representation of some 
(n,j) e 5*3, where 

53 = {(n, j) -.iBl such that in,j,l) £ 52} (3) 

or, in other words 

5*3 — {{n,j) : t has an unbordered factor of length n at index j}. (4) 

The size of M3 after subset construction was 1987 states, and it minimized 
down to 263 states. The output of Stage 3 passed all 1.9 million tests meant to 
ensure that M3 corresponds to S3. 

Stage 4 

Finally, Stage 4 simulates j on M3 and negates the result. So the output 
of Stage 3 is an automaton that accepts the binary representation of a positive 
integer n > 1 if and only if the Thue-Morse word has no unbordered factor of 
length n. Formally put, the automaton M4 produced by Stage 4 accepts a word 
w € {0, 1}* if and only if w is the base-2 representation of some n E S4, where 

^4 = {n e N : n > 1, ^j for which {n,j) £ 5*3}. (5) 

The size of M4 after subset construction is 2734 states, and it minimized to 7 
states. M4 accepts the reverse of 1(01*0)*10*1. Therefore the Thue-Morse word 
has an unbordered factor of length n if and only if the base-2 representation of 
n (starting with the most significant digit) is not of the form 1(01*0)*10*1. 

The total computation took 9 seconds of CPU time on a 2.9GHz Dell XPS 
laptop. D 

Remark 2. Here are some additional implementation details. 

In order to implement the needed operations on automata, we must decide 
on an encoding of elements of {SJ^)* ■ We could do this by performing a perfect 
shuffle of each individual word over S^, or by letting the alphabet itself be 
represented by fc-tuples. The decision represents a tradeoff between state size 
and alphabet size. We used the latter representation, since (a) it makes the 
algorithms considerably easier to implement and understand and (b) decreases 
the number of states needed. 

It was mentioned earlier how many tests were passed in each stage. In order 
to make sure that the final automaton is what we expect, a number of tests are 
run after each stage on the output of that stage. 



For example, let x be an automatic sequence. The testing framework requires 
a C++ function whicli given n computes x[ri]. Before any operations are done, 
the automaton given for x is tested against the C++ function to make sure that 
they match for the first 10,000 elements. Then, at each stage before Stage 4 
the resulting automaton is tested to give confidence that the operations on the 
automata are giving the desired results. 

For example, after Stage 2 of computing the set of lengths for which there 
exists an unbordered factor of an automatic sequence x, we expect the machine 
M2 to accept the language S2, where 

^2 — {{n, i, I) '.JM for which x[j' + i] = x[n + J — ^ + i]} (6) 

This is then tested by making sure M2 accepts (n, j, X)\^ if and only if (n, j', X) e 6*2 
for all n,j, Z < 1400. These tests were invaluable to debugging, and provide 
confidence in the final result of the computation. 

Finally, we have to address the issue of multiple representations. It is easy 
to forget that automata accept words in T,-k* i and not integers. For some op- 
erations, such as complement and intersection, it is crucial that if one binary 
representation is accepted by the automaton, then all binary representations 
must be accepted. 



4 Additional results 

We also applied our decision procedure above to two other famous sequences: 
the Rudin-Shapiro sequence |20|23j and the paperfolding sequence [9]. 

For a word w £ 1(0 + 1)*, we define awin) to be the number of (possibly 
overlapping) occurrences of w in the (ordinary, unreversed) base-2 representation 
of n. Thus, for example, an (7) = 2. 

The Rudin-Shapiro sequence r — r(0)r(l)r(2) • • • is then defined to be r{n) — 
(_l)aii("), Jt is a 2-automatic sequence generated by an automaton of four states. 

The paperfolding sequence p = p{0)p{l)p{2) • • • is defined as follows: writing 
(71)200 as VOaw for some « > some a £ {0, 1}, and some w € {0, 1}*, we have 
p{n) = (—1)". It is a 2-automatic sequence generated by an automaton of four 
states. 

Theorem 5. The Rudin-Shapiro sequence has an unbordered factor of every 
length. 

Proof. We applied the same technique discussed previously for the Thue-Morse 
sequence. 

Here is a summary of the computation: 
Stage 1: 269 states 

Stage 2: 85313 states minimized to 1974 
Stage 3: 48488 states minimized to 6465 
Stage 4: 6234 states. 



The Stage 4 NFA has 6234 states. We were unable to determmize this automa- 
ton directly (using two different programs) due to an explosion in the number 
of states created. Instead, we reversed the NFA (creating an NFA for L^) and 
determinized this instead. The resulting DFA has 30 states, and upon minimiza- 
tion, gives a 1-state automaton accepting all strings. D 

Theorem 6. The paperfolding sequence has an unhordered factor of length n if 
and only if the reversed representation (n)2 is rejected by the automaton given 
in Figurel^ 




Fig. 2. A finite automaton for unhordered factors in the paperfolding word 



Proof. We applied the same technique discussed previously for the Thue-Morse 
sequence. 

Here is a summary of the computation: 6 seconds cpu time on a 2.9GHz Dell 
XPS laptop. 



Stage 1, 159 states 

Stage 2, 1751 minimized down to 89 states 
Stage 3, 178 minimized down to 75 states 
Stage 4, 132 minimize down to 17 states . 



5 Further work 

In the future, we plan to extend this work to exphcitly compute the number of 
distinct unbordered factors of length n in the Thue-Morse sequence. (A conjec- 
ture about this number was given in [6J.) 

6 Open problems 

Which of the problems mentioned in § [1] are algorithmically decidable for the 
more general class of morphic sequences? 

Can the techniques be applied to detect abelian powers in automatic se- 
quences? 
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